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ABSTRACT 

Thanks to information extraction and semantic Web efforts, 
search on unstructured text is increasingly refined using se- 
mantic annotations and structured knowledge bases. How- 
ever, most users cannot become familiar with the schema 
of knowledge bases and ask structured queries. Interpret- 
ing free-format queries into a more structured representa- 
tion is of much current interest. The dominant paradigm 
is to segment or partition query tokens by purpose (refer- 
ences to types, entities, attribute names, attribute values, 
relations) and then launch the interpreted query on struc- 
tured knowledge bases. Given that structured knowledge 
extraction is never complete, here we use a data represen- 
tation that retains the unstructured text corpus, along with 
structured annotations (mentions of entities and relation- 
ships) on it. We propose two new, natural formulations 
for joint query interpretation and response ranking that ex- 
ploit bidirectional flow of information between the knowl- 
edge base and the corpus. One, inspired by probabilistic 
language models, computes expected response scores over 
the uncertainties of query interpretation. The other is based 
on max-margin discriminative learning, with latent variables 
representing those uncertainties. In the context of typed en- 
tity search, both formulations bridge a considerable part of 
the accuracy gap between a generic query that does not con- 
strain the type at all, and the upper bound where the "per- 
fect" target entity type of each query is provided by humans. 
Our formulations are also superior to a two-stage approach 
of first choosing a target type using recent query type predic- 
tion techniques, and then launching a type-restricted entity 
search query. 

1. INTRODUCTION 

Web information representation is getting more sophisti- 
cated, thanks to information extraction and semantic Web 
efforts. Much structured and semistructured data now sup- 
plements unstructured, free-format textual pages. In verti- 
cals such as e-commerce, the structured data can be accessed 
through forms and faceted search. However, a large number 
of free-format queries remain outside the scope of verticals. 
As we shall review in Section[2j there is much recent research 
on analyzing and annotating them. 

Here we focus on a specific kind of entity search query: 
Some words (called selectors) in the query are meant to oc- 
cur literally in a response document (as in traditional text 
search), but other words hint at the type of entity sought 
by the query. Unlike prior work on translating well-formed 
sentences or questions to structured queries using deep NLP, 
we are interested in handling "telegraphic" queries that are 
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typically sent to search engines. Each response entity must 
be a member of the hinted type. 

Note that this problem is quite different from finding an- 
swers to well-formed natural language questions (e.g., in 
Wolfram Alpha) from structured knowledge bases (perhaps 
curated through information extraction). Also observe that 
we do not restrict ourselves to queries that seek entities by 
attribute values or attributes of a given entity (both are 
valuable query templates for e-commerce and have been re- 
searched). In our setup, some responses may only be col- 
lected from diverse, open-domain, free-format text sources. 
E.g., typical driving time between Paris and Nice (the target 
type is time duration), or cricketers who scored centuries at 
Lords (the target type is cricketers). 

The target type (or a more general supertype, such as 
sportsperson in place of cricketer) may be instantiated in a 
catalog, but the typical user has no knowledge of the catalog 
or its schema. Large catalogs like Wikipedia or Freebase 
evolve "organically". They are not designed by linguists, and 
they are not minimal or canonical in any sense. Types have 
overlaps and redundancies. The query interpreter should 
take advantage of specialized types whenever available, but 
otherwise gracefully back off to broader types. 
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Entity: San Diego Padres 
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Evidence snippet:..-:.. \mentionOf 

By comparison, the Padres have been to two 
World Series, losing in 1984 and 1998. 

Figure 1: A partition of query words into hints and 
selectors, some partially matching types with mem- 
ber entities, and corpus snippets constitute a collec- 
tive, joint query interpretation and entity ranking 
problem. 

Figure [T] shows a query that has at least two plausible hint 
word sets: {team, baseball} (correct) and {world, series} (in- 
correct). Hint words partially match descriptions of types in 
a catalog, which lead to member entities. Potential response 
entities are mentioned in document snippets (one shown), 
which in turn partially match selector words (world, series, 
losing, 1998). Given a limited number of types to choose 
from, a human will find it trivial to pick the best. However, 
a program will find it very challenging to decide which sub- 
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set of query words are type hints, and, even after that, to 
select the best type(s) from a large type catalog. This query 
interpretation task is one part of our goal. 

We posit that corpus statistics provide critical signals for 
query interpretation. For example, we might benefit from 
knowing that San_Diego_Padres rarely co-occurs with the 
word "hockey", which can be known only from the corpus. 
Query interpretation should ideally be done jointly with 
ranking entities from the corpus, because it involves a del- 
icate combinatorial balance between the hint-selector split, 
and the (rather noisy) signals from the quality of matches be- 
tween type descriptions and hint words, snippets and other 
words, and mentions of entities in said snippets. 

Although query typing has been investigated before 
[5], to the best of our knowledge this is the first work on 
combining type interpretation with learning to rank [20] . In 
Section [4] we present a natural, generative formulation for 
the task using probabilistic language models. In Section [5] 
we present a more flexible and powerful max-margin discrim- 
inative approach [l8j [7] . 

In Section[6j we report on experiments involving 709 queries, 
over 200,000 types, 1.5 million entities, and 380 million ev- 
idence snippets collected from over 500 million Web pages. 
The entity ranking accuracy of a reasonable query inter- 
preter will be between the "lower bound" of a generic system 
that makes no effort to identify the target type (i.e., all cat- 
alog entities are candidates), and the upper bound of an 
unrealistic "perfect" system that knows the target type by 
magic. Our salient experimental observations are: 

• The generative language model approach improves en- 
tity ranking accuracy significantly beyond the lower 
bound wrt MAP, MRR and NDCG. 

• The discriminative approach is superior to generative; 
e.g., it bridges 43% of the MAP gap between the lower 
and upper bounds. 

• In fact, if we discard the entity ranks output from our 
system, use it only as a target type predictor, and is- 
sue a query with the predicted type, entity ranking 
accuracy drops. 

• Our discriminative approach beats a recent target type 
prediction algorithm by significant margins. 

• NLP-heavy techniques are not robust to telegraphic 
queries. 

Our data and code will be made publicly available. 

2. RELATED WORK 

Interpreting a free-format query into a structured form has 
been explored extensively in the information retrieval (IR) 
and Web search communities, with several recent dedicated 
workshop^] A preliminary but critical structuring step is 
to demarcate phrases [6] in free- format queries. There is 
also a large literature on topic-independent intent discovery 
[10| I IT] as well as topic-dependent facet [29] or template [l] 
inference. 

The problem of disambiguating named entities mentioned 
in queries is superficially similar to ours, but is technically 
quite different. In Figure |2j query word ymca may refer 
to different entities, but additional query word lyrics hints 
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at type music, whereas address hints at type organization. 
Note that the query text directly embeds a mention of an en- 
tity, not a type. Disambiguating the entity (usually) amounts 
to disambiguating the type — contrast Figure|2]with Figure[T] 
A given mention usually refers to only a few entities. In 
contrast, misinterpreting the hint often pollutes the entity 
response list beyond redemption. Delaying a hard choice of 
the target type, or avoiding it entirely, is likely to help. 



Query: 
ymca lyrics - 



V 



En 



X 



ity: 



"D 
O 

E 
o 

Q. 
O 



X 



Query: 

ymca address 



V 



YMCAJsong) o 

/ 

instanceOf 



/ 

Type: Music 



CO 
CD 



Entity: 

YMCAJorg) 

I 

instanceOf 
I 

Type: Organization 



CD 
O 

E 
o 

'CL 
O 



J 



Figure 2: Disambiguating named entities in queries. 

For entity disambiguation, Guo et al. [l4] proposed a prob- 
abilistic language model through weak supervision that learns 
to associate, e.g., lyrics with music and address with organi- 
zation. Pantel et al. 24, 25 pushed this farther by exploiting 
clicks and modeling intent. Hu et al. [16] addressed a similar 
problem. None gave a discriminative max-margin formula- 
tion, or unified the framework with learning to rank. 

Given that the database community uses SQL and XQuery 
as unambiguous, structured representations of information 
needs, and that the NLP community seeks to parse sen- 
tences to a well-defined meaning, there also exists conver- 
gent database and NLP literature on interpreting free-format 
(source) queries into a suitable target "query language". Nat- 
urally, much of this work seeks to identify types, entities, 
attributes, and relations in queries. Although the theoreti- 
cal problem is challenging 13 , a common underlying theme 
is that each token in the query may be an expression of 
schema elements, entities, or relationships: this leads to a 
general assignment problem, which is solved approximately 
using various techniques, summarized below. 

Sarkas et al. [32] annotated e-commerce queries using schema 
and data in a structured product catalog. In the context of 
Web-extracted knowledge bases such as YAGO |33| , Pound 
et al. 28, 27 set up a collective assignment problem with a 
cost model that reflects syntactic similarity between query 
fragments and their assigned concepts, as well as semantic 
coherence between concepts [19]. Sarkas, Pound and oth- 
ers, like us, handle "telegraphic queries" that may not be 
well-formed sentences. DEANNA [37] solved the collective 
assignment problem using an integer program. It is capa- 
ble of parsing queries as complex as "which director has 
won the Academy Award for best director and is married 
to an actress that has won the Academy Award for best ac- 
tress?" As might be expected, DEANNA is rather sensitive 
to query syntax and often fails on telegraphic queries. All 
these systems interpret the query with the help of a fairly 
clean, structured knowledge base. 
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give discriminative learning-to-rank algorithms that jointly 
disambiguate the query and ranks responses. IBM's Watson 



23 identifies candidate entities first, and then scores them 
for compatibility with likely target types. 

In this work, we do not assume that a knowledge base 
has been curated ahead of time from a text corpus. Instead 
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Ql 


Woodrow Wilson was presi- 
dent of which university? 


woodrow wilson president 
university 


Q2 


Which Chinese cities have 
many international compa- 
nies? 


Chinese city many 
international companies 


Q3 


What cathedral is in Claude 
Monet's paintings? 


cathedral claude monet 
paintings 


Q4 


Along the banks of what 
river is the Hermitage Mu- 
seum located? 


hermitage museum banks of 


Q5 


At what institute was Dolly 
cloned? 


dolly clone institute 


Q6 


Who made the first air- 
plane? 


first airplane inventor 



Figure 3: Natural language queries and typical 
telegraphic forms, with potential type description 
matches underlined. 

we assume entities and types have been annotated on spans 
of unstructured text. Accordingly, we step back from so- 
phisticated target schemata, settling for three basic relations 
(instanceOf subTypeOf, and mentionOf see Figure [T| that 
link a structured entity catalog with an unstructured text 
corpus (such as the Web). On the other hand, we take the 
first step toward integrating learning-to-rank [20] techniques 
with query interpretation. 

Closest to our goal are those of Vallet and Zaragoza [36] 
and Balog and Neumayer (B&N) [ol. Vallet and Zaragoza 
first collected a ranked list of entities by launching a query 
without any type constraints. Each entity belongs to a hi- 
erarchy of types. They accrued a score in favor of a type 
from every entity as a function of its rank, and ranked types 
by decreasing total score. B&N investigated two techniques. 
In the first, descriptions of all entities e belonging to each 
type t were concatenated into a super document for t, and 
turned into a language model. In the second (similar in 
spirit to Vallet and Zaragoza), the score of t was calculated 
as a weighted average of probabilities of entity description 
language models generating the query, for e £ t. 

These approaches [5]|30] use long entity descriptions, such 
as found on the Wikipedia page representing an entity, but 
not a corpus where entity mentions are annotated. The cor- 
pus documents may well not be definitional, and yet remark- 
ably improves entity ranking accuracy, as we shall see. None 
of [361 [5] [30] attempt a segmentation of query words by pur- 
pose (target type vs. literal matches). 

3. BACKGROUND AND NOTATION 

3.1 "Telegraphic" queries 

A "telegraphic" entity search query q expresses an infor- 
mation need that is satisfied by one or more entities. Query 
q is a sequence of \q\ words. The j'th word of query q is 
denoted W q j, where j = 1, . . . , and subscript q in w q .j 
is omitted if clear from context. We will interchangeably 
use q (as a query identifier) and q (to highlight that it is a 
sequence of words). Unlike full, well- formed, grammatical 
sentences or questions, telegraphic queries resemble short 
Web search queries having no clear subject-verb-object or 
other complex clausal structure. Some examples of natural 
telegraphic entity search queries and possible natural lan- 
guage "translations" are shown in Figure [3] Q denotes a set 
of queries. 

3.2 The entity and type catalog 



The catalog (T,£, C + , € ), is a directed acyclic graph of 
type nodes t £ T, with edges representing the "is-subtype- 
of" transitive binary relation C + . Each type t is described 
by one or more lemmas (descriptive phrases) L{t), e.g., |Aus^] 
trian physicists 

Each entity e in the catalog is also represented by a node 
connected by "is-instance-of" edge(s) to one or more most 
specific type nodes, and transitively belongs to all super- 
types; this relation is represented as £ + . An entity e may 
be a candidate for a query q. The set of candidate entities 
for query q is called £ q C £. In training data, an entity e 
may be labeled relevant (denoted e+) or irrelevant (denoted 
e_) for q. £ q is accordingly partitioned into £ q ,£ q . 

3.3 Annotated corpus and snippets 

The corpus is a set of free-format text documents. Each 
document is modeled as a sequence of words. Entity e is 
mentioned at some places in an unstructured text corpus. 
A "mention" is a token span (e.g., Big Apple) that gives 
evidence of reference to e (e.g., |New_York_City[ ). The men- 
tion span, together with a suitable window of context words 
around it, is called a snippet. The set of snippets mentioning 
e is called 5 e . c £ 5 e is one snippet context supporting e. 

In the Wikipedia corpus, most mentions are annotated 
manually as wiki hyperlinks. For Web text, statistical learn- 
ing techniques 19 [l5] are used for high-quality annotations. 
Here we assume mentions to be correct and deterministic. 
Extending our work to noisy mentions is left for future work. 

4. GENERATIVE FORMULATION 

Given the success of generative techniques in corpus mod- 
eling [8] , IR [39] and entity ranking [3j [4] , it is natural to 
propose a generative language model approach to joint query 
interpretation and response ranking. 

As is common in generative language models, we will fix 
an entity e and generate the query words, by taking the 
following steps: 

1. Choose a type from {t : e £ + t}; 

2. Describe that type using one or more query words, 
which will be called hint words; 

3. Collect snippets that mention e; and 

4. Generate the remainder of the query by sampling words 
from these snippets. 

Our goal is to rank entities by probability given the query, 
by taking the expectation over possible types and hints. 

4.1 Choosing a type given e 

Given entity e, we first pick a type t such that e £ + t, 
and describe t in the query (with the expectation that the 
system will infer t, then instantiate it to e as a response). 
So the basic question looks like: "if the answer is Albert 
Einstein, what type (among scientist, person, organism, etc.) 
is likely to be mentioned in the query, before we inspect the 
query?" (After we see the query, our beliefs will change, 
e.g., depending on whether the query asks u who discovered 
general relativity?" vs. "which physicist discovered general 
relativity?") So we need to design the prior distribution 
Pr(t|e). 

Recall that there may be hundreds of thousands of ts, and 
tens of millions of es, so fitting the prior for each e separately 
is out of the question. On the other hand, the prior is just a 
mild guidance mechanism to discourage obscure or low-recall 
types like "Austrian Physicists who died in 1972". Therefore, 
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we propose the following crude but efficient estimate. From 
a query log with ground truth (i.e., each query accompanied 
with a t provided by a human), accumulate a hit count Nt 
for each type t. At query time, given a candidate e, we 
calculate 




Pr(t|e) 



where 7 £ (0, 1) is a tuned constant. 



e e + t 
otherwise 



(1) 



4.2 Query word switch variables 

Suppose the query is the word sequence (tVj,j = 1, . . . , \ q\). 
For each position j, we posit a binary switch variable Zj £ 
{h, s}. Each Zj will be generated iid from a Bernoulli dis- 
tribution with tuned parameter <5 G (0, 1). If Zj = h, then 
word Wj is intended as a hint to the target type. Otherwise 
Wj is a selector sampled from snippets mentioning entity e. 
The vector of switch variables is called z. 

The number of possible partitions of query words into 
hints and selectors is 2' 9 '. By definition, telegraphic queries 
are short, so 2' 9 ' is manageable. One can also reduce this 
search space by asserting additional constraints, without 
compromising quality in practice. E.g., we can restrict the 
type hint to a contiguous span with at most three tokens. 

Given q and a proposed partition z, we define two helper 
functions, overloading symbols s and h: 

Hint words of q: h(q, z) = {w q j : Zj = h} (2) 
Selector words of q: s(q, z) = {w 9 j : Zj = s}. (3) 

With these definitions, in the exhaustive hint-selector par- 
tition case, z is the result of \q\ Bernoulli trials with hint 
probability S £ (0, 1) for each word, so we have 



Pr(z) 



5) 



(4) 



S is tuned using training data. 

In this paper we will consider strict partitions of query 
words between hints and selectors, but it is not difficult to 
generalize to words that may be both hints and selectors. 
Assuming each query word has a purpose, the full space 
grows to 3' 9 ', but assuming contiguity of the hint segment 
again reduces the space to essentially 0(\q\). 

4.3 Type description language model 

Globally across queries, the textual description of each 
type t induces a language model. We can define the ex- 
act form of the model in any number of ways, but, to keep 
implementations efficient, we will make the commonly used 
assumption that hint words are conditionally independent 
of each other given the type. Each type t is described by 
one or more lemmas (descriptive phrases) L(t), e.g., Aus- 
trian physicists Because lemmas are very short, words are 
rarely repeated, so we can use the multivariate Bernoulli [22] 
distribution derived from lemma £: 



Pr(w\£) 



if w appears in 
otherwise 



(5) 



Following usual smoothing policies [39], we interpolate the 
smoothed distribution above with a background language 



model created from all types: 

~}2 t< z T hf appears in I ; I £ L(t)J 



Px(w\T) 



\T\ 



(6) 



in words, the fraction of all types that contain w. We splice 
together |5| and |6]| using parameter j3 G (0, 1): 



Pr{w\£) = (1 - P)Pr(w\£) + /3Pr(w|T). 



(7) 



The probability of generating exactly the hint words in the 
query is 

Pi(h(q,z)\£) = I] Pi{w\£) Yl (l-Pr(™W), (8) 

where w ranges over the entire vocabulary of type descrip- 
tions. In case of multiple lemmas describing a type, 



Pr(-li) = max Pr(-lf); 



(9) 



i.e., use the most favorable lemma. All fitted parameters in 
the distribution Pr(w\£) are collectively called tp. 

4.4 Entity snippet language model 

The selector part of the query, s(q,z), is generated from 
a language model derived from S e , the set of snippets that 
mention candidate entity e. For simplicity we use the same 
kind of smoothed multivariate Bernoulli distribution to build 
the language model as we did for the type descriptions. Note 
that words that appear in snippets but not in the query are 
of no concern in a language model that seeks to generate the 
query from distributions associated with the snippets. Sup- 
pose corpusCount(e) is the number of mentions of e in the 
corpus C, and corpusCount(e, w) be the number of mentions 
of e where w also occurs within a specified snippet window 
width. The unsmoothed probability of generating a query 
word w from the snippets of e is 



Pr(ui|e) = 



corpusCount(e,w) \{s 6 «S e : w € s}\ 
corpusCount(e) corpus Count(e) 



(10) 



As before, we will smooth the above estimate using an corpus- 
level, entity-independent background word distribution esti- 
mate: 

Pr(™|C) = -i-(number of documents containing w). (11) 



And now we use the interpolation 

Pr(w|e) = (1 - a)Pr(tu|e) + aPr(w|C), 



(12) 



where a £ (0, 1) is a suitable smoothing parameter. The 
fitted parameters of the Pr(ui|e) distribution are collectively 
called 6. Similar to Q, the selector part of the query is 
generated with probability 

Pr(s(q,z)\e)= J] Pr(w\e) ]J (1 - Pr(u»|e)), (13) 
except here w ranges over all query words. 

4.5 Putting the pieces together 

A plate diagram for the process generating a query q 
is shown in Figure [4] Vertices are marked with random 
variables E, T, Z, W whose instantiations are specific values 
e, t, z,w G q. 

The hidden variables of interest are the binary Z £ {h, s}, 
for selecting between type hint (h) and selector (s) words; 
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Figure 4: Plate diagram for generating a query q 
from a candidate entity e. Only (w 9J : j = 1, . . . , \q\) 
are observed variables, tp represents the type de- 
scription language model and 6 represents the entity 
mention snippets language model. (z q j : j = 1, . . . , \q\) 
are the hidden switch variables. T is the hidden type 
variable. 

and T, the type of one query. Each query picks one hidden 
value t, and a vector of |g| size for Z, denoted z. The only ob- 



served variables are the \q\ query words (wj : j = 1, 



Also, a, /3, 7, 5 are hyper-parameters tuned globally across 
queries. 

In the end we are interested in argmax e Pr(e|(f), where 

Vv{e\q) oc Pr(e, q) = Pr(e) Pr(gle) = Pr(e) ^ Pr(q, t, z\e) 

t.s 

= Pr(e) Pr(tie)Pr(z|e, t) Pr^e, t, z) (14) 

t.z 

w Pr(e) Y] Pr(t|e) Pr(z) Pr(g|e, t, z) (15) 

t.z 

= Pr(e) VPr(t|e)Pr(f)Pr(/i(g,z)|t)Pr(s(5,f)|e) . 

E} ID ED 

To get from to ( |15[ ) we make the simplifying assumption 
that the density of hint words in queries is independent of 
the candidate entity and type. As mentioned before, adding 
over t, z is feasible for telegraphic queries because they are 
short. The prior Pr(e) may be uninformative (i.e., uniform), 
or set proportional to |<S e | [21], or use shrunk estimates from 
answer types in the past. We use Pr(e) = \S e \/^2 e i \S e /\. 

If we allow a query word to represent both a type hint and 
a selector, the clean separation after (151 no longer works, 



but it is possible to extend the framework using a soft-OR 
expression. We omit details owing to space constraints. 

4.6 Explaining a top-ranking entity 

In standard text search, top-ranking URLs are accompa- 
nied by a summary with matching query words highlighted. 
In our system, top-ranking entities need to be justified by 
explaining to the user how the query was interpreted. Specif- 
ically, we need to show the user the inferred type, and the 
inferred purpose (hint or selector) of each query word. 

Pr(t, z\e, q) oc Pr(e, t, q, z) 

= Pr(e) Pr(t|e)Pr(z|e, t) Pr(g|e, t, z) 



Pr(e) Pr(t|e)Pr(z) Pr(£|e, t, z) 



(16) 



possible to report marginals such as Pr(t[e, q) or Pi(zj\e, q) 
this way. 

4.7 Potential pitfalls 

As often happens, a generative formulation starts out feel- 
ing natural, but is soon mired in a number of questionable 
assumptions and tuned hyper parameters. In recent times, 
this story has played out in many problems, such as informa- 
tion extraction [3l] and learning to rank [20], where gener- 
ative language models were proposed earlier, but the latest 
algorithms are all discriminatively trained. The above for- 
mulation has several potential shortcomings: 

• The modeling of Pr(t|e) is necessarily a compromise. 

• Pr(zj) is assumed to be independent of q and e, and 
iid. These assumptions may not be the best. 

• In the interest of computational feasibility, the lan- 
guage models for both types and snippets are simplis- 
tic. Phrase and exact matches are difficult to capture. 

• Hyper parameters a, (3, 7, 8 can only be tuned by sweep- 
ing ranges; no effective learning technique is obvious. 

• As often happens with complex generative models, the 
scales of probabilities being multiplied ( 15 1 are diverse 
and hard to balance. 

5. DISCRIMINATIVE FORMULATION 

Instead of designing conditional distributions as in Sec- 
tion|4j here we will design feature functions, and learn weights 
corresponding to them by using relevant and (samples of) ir- 
relevant entity sets £^ ,£~ associated with each query q, as 
is standard in learning to rank [20]. The benefit is that it 
is much safer to incrementally add highly informative but 
strongly correlated features (such as exact phrase match, 
match with and without stemming, etc.) to discriminative 
formulations. 

Standard notation used in structured max-margin learning 
uses (f)(x, y) £ K d as the feature map, where x is an obser- 
vation and y is the label to be predicted. A model A G K d 
is fitted so that A • (j)(x,y COIIC ct) > A ■ 4>{x,yi ncovvcct ). Once 
A gets fixed via training, given a new text instance x tcs t, 
inference is the process of finding argmax H A ■ 0(a; tC st, y)- 

In our case, we use the notation <f>(q, e, t, z) for the feature 
map. q gives us access to the sequence of words in the query, 
and is the analog of x above, e gives us access to the snippets 
S e that support e, and is the analog of y above, t and z are 
latent variable [38] inputs to the feature map whose role will 
be explained shortly. 

Guided by the generative formulation in Section [4] we 
partition the feature vector as follows: 

<p(q,e,t,z) = ((pi(q,e),<f>2(t,e),<f> 3 (q,z,t),(f>4(q,z,e)), (17) 

where 

• <f>i(q,e) models the prior for e. 

• 4>2(t,e) models the prior Pr(t[e). 

• (j>3(q,z,t) models the compatibility between the type 
hint part of query words and the proposed type t. 

• <j>4 (q, z, e) models the compatibility between the selec- 
tor part of query words and S e ■ 



5.1 Features 

In Section 



4.5 



modeling entity prior 

we used Pr(e) = |<S e |/£ e , |<S e /| 



approximating Pr(z|e, t) ~ Pr(f) as before. Now we can re- 
port arg maXi^Prff, z\e, q) as the explanation for e. It is also 



probability for e. It 



as a prior 

is natural to make this one element 
in <f>\. But the discriminative setup allows us to introduce 
other powerful features. 
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|iS e | does not distinguish between snippets that match the 
query well vs. poorly. Let IDF(w) be the inverse document 
frequency |2j of query word w, and IDF(q) = ^2 meq IDF(io). 
c n q is the set of query words found in snippet c, with to- 
tal IDF(cn q) = E™ 6c n 9 IDF M- Tnen the match-quality- 
weighted snippet support for e is characterized as 



0i(9>e)[-] 



2l«l IDF(g) 



IDF(cng), 



(18) 



where 2' q ' IDF(g) normalizes the feature across diverse queries. 

Another feature in cj>i relates to negative evidence. If 
there are other words present, a query that directly men- 
tions an entity is hardly ever answered correctly by that 
entity; Tom_Cruise could not be the answer for the query 
torn cruise wife. Another (0/1) element in <f>i is whether 
a description ("lemma") of e is contained in the query. In 
our experiments, the model element in A corresponding to 
this feature turns out a negative number, as expected. 

5.2 Features 4> 2 modeling type prior 

We have already proposed one way to estimate Pr(f|e) in 



Section [4. 1| This estimate a natural element in (f>2. We can 
also help the learner use the generality or specificity of types, 
measured as this feature: {e : e G + i}|/|£|. In our experi- 
ments, the element of A corresponding to this feature also 
got negative values, indicating preference of specific types 
over generic ones. This corroborates earlier observation re- 
garding the depth of desired types in a hierarchy p\. 

5.3 Hint-type compatibility features fa 

Given the input parameters of fa(q, z, t), we compute the 
hint word subsequence h(q, z) as in Now we can define 
any number of features between these hint words and the 
given type t, which has lemma set L(t). 

• A standard feature borrowed from |9]l is Pr(/i(q, z)\t). 

• Unlike in the generative formulation, we can add syn- 
thetic features. E.g., a feature that has value 1 if t 
matches the subsequence h(q, z) exactly. 

• In Section [4] the size of h(q,z) was drawn from a bi- 
nomial distribution controlled by hyper parameter 5. 
To model more general distributions, we use binary 
features of the form 

1, \h{q,z)\<k 
0, otherwise 

for k = 1, . . ., to capture the belief that smaller number 
of hint words is preferable. 

5.4 Selector-snippets compatibility features fa 

Now consider q and its selectors s(q, z) C q as word sets 
(no duplicates), and the snippets S e supporting candidate 
entity e. fa(q,z,e) will include feature/s that express the 
extent of match or compatibility between the selector words 
and the snippets. We need to characterize and then combine 
two kinds of signals here: 

• The rarity (hence, informativeness) of a subset of s(q, z) 
that match in snippets, and 

• The number of supporting snippets [51] that match a 
given word set. 

(A third kind of signal, proximity [26||35|[34] , is favored indi- 
rectly, because snippets have limited width. A more refined 
treatment of proximity is left for future work.) 



A snippet c G 5 e , interpreted as a subset of query words 
q, covers s(q, z) if c D s(q,z). Otherwise c C s(q, z). Re- 
call every snippet c has an IDF(c) = 5^ ujgcn(j IDF(to). We 
propose two features: 
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2\i\ IDF(q) 



J2 IDF( S (g,i*)) 



c2)s(q,2*) 

lDF(s(q,z)) \{c:cDs(q,z)}\ 



and 



2l*l IDF(q) 



2\i\ IDF(q) 
E IDF(c) 

cCs(q,z) 



(19) 
(20) 



We found the separation above to be superior to collapsing 
covering and non-covering snippets into one sum. Another 
useful feature was the fraction of snippets c such that c = q 
(exactly matching all query words). 

5.5 Inference and training 

With a wrong choice of hint-selector partition z, or a 
wrong choice of type t, even a highly relevant response e 
could score very poorly. Therefore, any reasonable scoring 
scheme should evaluate e under the best choice of t,z. I.e., 
the score of e should be 



max A • 1 

t:eg + t,z 



, e, i, z) 



(21) 



(Note that t ranges over only those types to which e belongs.) 
In learning to rank [20] , three training paradigms are com- 
monly used: itemwise, pairwise and listwise. Because of the 
added complexity from the latent variables t, z, here we dis- 
cuss itemwise and pairwise training. Listwise training is left 
for future work. 

In itemwise training, each response entity e is one item, 
which can be good (relevant, denoted e+) or bad (irrelevant, 
denoted e_). Following standard max-margin methodology, 
we want 

Vg, e+ : max A • <j>(q, e_i_, t, z) > 1 — £„. ej _ , and 
Vg,e- 



max A • (j)(q, e + ,t,z) > 1 — ^ 9 , e+ , 
max A - (j>(q,e-,t, z) < 1 +£j, e _, 



(22) 
(23) 



where fg, e+ , £<j,e_ > are the usual SVM-style slack vari- 
ables. Constraint ( |23| l is easy to handle by breaking it up 
into the conjunct: 



Vq, e_ , Vt, V5* 



A ■ cf)(q, e,t, z) < 1 + £ 9 , e 



(24) 



However, (221 is a disjunctive constraint, as also arises in 
multiple instance classification or ranking [7]. A common 
way of dealing with this is to modify constraint ( 22 1 into 



V<7,e+: ^2u(q,e+,t, z)\ ■ <j>(q,e+,t,z) > 1 - £,, e+ (25) 

f.F 

where u(q, e,t, z) € {0, 1} and 

V<7, e+ : E U (<1> e +> *. 2) = 1. 

t.z 

This is an integer program, so the next step is to relax the 
new variables to < u(q, e, t, z) < 1 (i.e., the (t, £)-simplex). 
Unfortunately, owing to the introduction of new variables 
u(- ■ ■) and multiplication with old variables A, the optimiza- 
tion is no longer convex. 

Bergeron et al. [7] propose an alternating optimization: 
holding one of u and A fixed, optimize the other, and repeat 
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(there are no theoretical guarantees). Note that if A is fixed, 
the optimization of u is a simple linear program. If u is fixed, 
the optimization of A is comparable to training a standard 
SVM. The objective would then take the form 



^n J + i|£- +e; ~" 



\£J\ + \£n 



(26) 



Here C > is the usual SVM parameter trading off training 
loss against model complexity. Note that it does not appear 
in the objective. 

In our application, <f>(q, e, t, z) > 0. Suppose A > in some 
iteration (which easily happens in our application). In that 
case, to satisfy constraint ( |25[ ), it suffices to set only one 
element in u to 1, corresponding to arg max tj j A • cj>(q, e, t, z), 
and the rest to 0s. This severely restricts the search space 
over u, A in subsequent iterations. 

To mitigate this problem, we propose the following anneal- 
ing protocol. The u distribution collapse reduces entropy 
suddenly. The remedy is to subtract from the objective (to 
be minimized) a term related to the entropy of the u distri- 
bution: 



@ + D ^2 ^2,u{q,e+,t,z)\ogu{q,e+,t,z). 



(27) 



q,e + t,z 



Here D > is a temperature parameter that is gradually 
reduced in powers of 10 toward zero with the alternative 
iterations optimizing u and A. Note that the objective ( |27[ ) 
is convex in u, A and £». Moreover, with either u or A fixed, 
all constraints are linear inequalities. 



initialize u to random values on the simplex 
initialize D to some positive value 
while not reached local optimum do 

fix u and solve quadratic program to get next A 

reduce D geometrically 

fix A and solve convex program for next u 



Figure 5: Pseudocode for discriminative training. 

Very little changes if we extend from itemwise to pairwise 
training, except the optimization gets slower, because of the 
sheer number of pair constraints of the form: 



Vg, e_|_, e_ : max A ■ cj>(q, e + ,t, z) — max A • (f>(q, e_, t, z) 

- (28) 



t.z 

>i-& 



The itemwise objective in ( 26 1 changes to the pairwise ob- 
jectice 



U X f + f^T \o+i it- 1 ^> e +- 



(29) 



u-like variables can be used to convert this to an alternating 
optimization as before; details are omitted. 

5.6 Implementation details 

5.6.1 Reducing computational requirements 

The space of (q, e, t, z) and especially their discriminative 
constraints can become prohibitively large. To keep RAM 
and CPU needs practical, we used the following policies; our 
experimental results are insensitive to them. 

• We sampled down bad (irrelevant) entities e_ that 
were allowed to generate constraint (1281). 



• For empty h(q,z) = 0, cj>3(q,z,t) provides no signal. 
In such cases, we allow t to take only one value: the 
most generic type Entity. 

5.6.2 Explaining a top-ranking entity 

This is even simpler in the discriminative setting than 
in the generative setting; we can simply use \21\ to report 
argmax t: £ A • cj>(q, e, t, z). 

5.6.3 Implementing a target type predictor 

Extending the above scheme, each entity e scores each 
candidate types t as score(t\e) = max^A ■ </}(-, e,t, z). This 
induces a ranking over types for each entity. We can choose 
the overall type predicted by the query as the one whose sum 
of ranks among the top-fc entities is smallest. An apparently 
crude approximation would be to predict the best type for 
the single top-ranked entity. But k > 1 can stabilize the 
predicted type, in case the top entity is incorrect. (We may 
want to predict a single type as a feedback to the user, or 
to compare with other type prediction systems, but, as we 
shall see, not for the best quality of entity ranking, which is 
best done collectively.) 

6. EXPERIMENTS 

6.1 Testbed 

6.1.1 Catalog and annotated corpus 

Our type and entity catalog was YAGO [33], with about 
200,000 types and 1.9 million entities. An annotator trained 
on mentions of these entities in Wikipedia was run over a 
Web corpus from a commercial search engine, having 500 
million spam-free Web pages. This resulted in about 8 billion 
entity annotations, average 16 annotations per page. These 
were then indexed [12] - 

6. 1.2 Type constrained entity search 

The index supports semistructured queries of the following 
form: 

• an answer type t from among the 200,000 YAGO types, 

• a bag of words and phrases in a IDF- WAND (weak- 
and) operator [11] , and 

• a snippet window width. 

A DAAT [ll] query processor returns a stream of snippets at 
most as wide as the given window width limit, that contain a 
mention of some entity e £ + t and satisfies the WAND pred- 
icate. In case of phrases in the query, the WAND threshold 
is computed by adding the IDF of constituent words. 

Our query processor is implemented using MG4J [9] in 
Java, with no index caching. Basic keyword WAND queries 
take a few seconds over 500 million documents. Setting 
t — Entity, the root type, and asking for a stream of all 
entities in qualifying snippets, slows down the query by a 
small factor. This is all that our algorithm needs from the 
corpus; we did not focus on query time because standard 
caching techniques and tighter code can improve it trivially. 

6.2 Queries with ground truth 

We use 709 entity search queries collected from many years 
of TREC and INEX competitions, along with relevant and 
irrelevant entities. Two paid masters students, familiar with 
Web search engines, read the full TREC/INEX description 
of entity search queries and wrote out queries they would 
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naturally issue to a commercial search engine. They also se- 
lected the best (as per their judgment) type from YAGO for 
each query, as ground truth. This data is publicly available 
at bit.ly/WSpxvr, Launching the queries with the known 
types resulted in 380 million snippets supporting candidate 
entities; these are also available on request. We also per- 



formed type prediction (Section 5.6.3 1 on dataset provided 
in [H]. Since this dataset does not contain ground truth of 
relevant entities for each query, we did not test entity rank- 
ing. 

6.3 Generic and "perfect" baselines 

The ranking accuracy of a reasonable query interpreter 
algorithm in our framework will lie between two baselines: 
Generic: The generic baseline assumes zero knowledge of 
query types, instead using t = Entity, the root/s of 
the type hierarchy in the catalog. 
"Perfect": The "perfect" baseline assumes complete (human- 
provided) knowledge of the type and uses it in the 
semistructured query launched over the catalog and 
annotated corpus. 
Of course, even "perfect" may perform poorly in some queries, 
because of lack of support for relevant entities in the corpus, 
snippets incorrectly or not annotated (both false positive 
and negative), or incorrect absence of paths between types 
and entities in the catalog. It is also possible for an al- 
gorithm (including ours) to perform worse than generic on 
some queries, by choosing a particularly unfortunate type, 
but obviously it should do better than generic on average, 
to be useful. 

6.4 Measurements and results 

As is standard in entity ranking research, we report NDCG 
at various ranks, mean reciprocal rank (MRR, not trun- 
cated) and mean average precision (MAP) at the entity (not 
document) level. Space constraints prevent us from defin- 
ing these; see Liu [20] for details. For Discriminative, C 
is tuned by 5-fold cross validation at the query level. For 
Generative, we swept over a, f3, 7, 8 in powers of 10 (e.g. 

io- 5 ,io- 4 ,...,i). 
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Discriminative 


Perfect 


MAP 


0.323 u 


0.414 4 


0.462 


0.644 


MRR 


0.332 u 


0.432 4 


0.481 


0.664 




1 2 3 4 5Rank6 7 8 9 10 

Figure 6: Generic, generative, discriminative and 
"perfect" accuracies. 

6.4.1 Our algorithms vs. generic and perfect 

For our techniques to be useful, they must bridge a sub- 
stantial part of the gap between the generic lower bound and 
the perfect upper bound. Figure [6] confirms that Generative 
bridges 28% of the MAP gap between generic and perfect, 
whereas discriminative is significantly better at 43%. MRR 




Queries-> 



Figure 7: MAP of discriminative minus map of 
generic, compared query-wise between generic and 
discriminative. Below zero means discriminative did 
worse than generic on that query. Queries in (arbi- 
trary) order of discriminative AP gain. 

and NDCG follow similar trends. All gaps are statistically 
significant at 95% confidence level (indicated by \). 

Figure [6] is aggregated over all queries. Figure [7] focuses 
on average precision disaggregated into queries, comparing 
discriminative against generic. While some queries are dam- 
aged by discriminative, many more are improved. 

Failure analysis revealed residual (t, z) ambiguity, coupled 
with lack of G + or C + paths in an incomplete catalog to be 
the major reasons for losses on some queries. Even though 
there is some ground yet to cover to reach "perfect" levels, 
these results show there is much hope for automatically in- 
terpreting even telegraphic queries. 

6.4.2 Benefits of annealing optimization 

Figure [8] shows that discriminative with our entropy-based 
annealing protocol performs significantly (marked with "4.") 
better than the scheme proposed by Bergeron et al. [7] . This 
may be of independent interest in multiple instance ranking 
and max-margin learning with latent variables. 

Entropy (|27[) 



MAP 



MRR 



Bergeron |26j 



0.416 4 



0.432 4 



0.462 



0.481 



Figure 8: Benefits of annealing protocol. 

6. 4. 3 Benefits of joint inference 

A central premise of our work is that joint inference is bet- 
ter than a two-stage process (predict type, launch query). To 
test the essence of this hypothesis, we run our system, throw 
away the ranked entity list, and only retain the predicted 
type (Section |5.6.3[ ), then launch a query restricted to this 
type (Section|6.1.2[) and measure entity ranking accuracy. 
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Figure 9: Joint inference improves entity ranking 
quality compared to 2-stage. 
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Figure [9] shows that the result is significantly (shown by 
less accurate than via joint inference, even after tuning 
k, which indicates that no single inferred type may retain 
enough information for the best entity ranking. 

6.4.4 Comparison with B&N's type prediction 

B&N [5] proposed two models, of which the "entity-centric" 
model was generally superior. Each entity e was associated 
with a textual description (e.g., Wikipedia page) which in- 
duced a smoothed language model 8 e . B&N estimate the 
score of type t as 



Pr(<z|i)= Pr(#e)Pr(e|t), 

eg + t 



(30) 



where Pr(e[i) was set to uniform. Note that no corpus (apart 
from the one of entity descriptions) was used. The output 
of B&N's algorithm (hereafter, "B&N") is a ranked list of 
types, not entities. We implemented B&N, and obtained ac- 
curacy closely matching their published numbers, using the 
DBpedia catalog with 358 types, and 258 queries (different 
from our main query set and testbed). 

We turned our system into a type predictor (Section |5.6.3| l , 
and also used DBpedia like B&N and compared type predic- 
tion accuracy on dataset provided in [5]. Results are shown 
in Figure [To] At k = 1, our discriminative type prediction 
matches B&N, and larger k performs better, owing to sta- 
bilizing consensus from lower-ranked entities. Coupled with 
the results in Section [6. 4. 3[ this is strong evidence that our 
unified formulation is superior, even if the goal is type pre- 
diction. 
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Figure 10: Type prediction by B&N vs. discrimina- 
tive. 

6.4.5 Comparison with B&N-based entity ranking 

A type prediction may be less than ideal, and yet entity 
prediction may be fine. One can take the top type predicted 
by B&N, and launch a query (see Section 6.1.21 with that 
type restriction. To improve recall, we can also take the 
union of the top k predicted types. The search results in 
a ranked list of entities, on which we can compute entity- 
level MAP, MRR, NDCG, as usual. In this setting, both 
B&N and our algorithm (discriminative) used YAGO as the 
catalog. Results for our dataset (Section |6.2[ ) are shown in 
Figure [TTJ 

We were surprised to see the low entity ranking accu- 
racy (which is why we recreated very closely their reported 
type ranking accuracy on DBpedia). Closer scrutiny re- 
vealed that the main reason for lower accuracy was chang- 
ing the type catalog from DBpedia (358 types) to YAGO 
(over 200,000 types). Entity ranking accuracy is low because 
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Figure 11: B&N-driven entity ranking accuracy. 

B&N's type prediction accuracy is very low on YAGO: 0.04 
MRR, 0.04 MAP, and 0.058 NDCG@10. For comparison, 
our type prediction accuracy is 0.348 MRR, 0.348 MAP, and 
0.475 NDCGCD10. This is entirely because of corpus/snippet 
signal: if we switch off snippet-based features (j>4, our accu- 
racy also plummets. The moral seems to be, large organic 
type catalogs provide enough partial and spurious matches 
for any choice of hints, so it is essential (and rewarding) to 
exploit corpus signals. 




301 401 501 601 701 

Query-> 

Figure 12: 2-stage entity ranking via B&N does 
boost accuracy for some queries, but the overall ef- 
fect is negative. Joint interpretation and ranking 
also damages some queries but improves many more. 

On an average, B&N type prediction, followed by query 
launch, seems worse than generic. This is almost entirely 
because of choosing bad types for many, but not all queries. 
There are queries where B&N shows a (e.g., MAP) lift be- 
yond generic, but they are just too few (Figure [l2|. 



k 


MAP 


MRR 


1 


0.135 


0.145 


5 


0.240 


0.250 


10 


0.295 


0.305 


Discr 


0.422 


0.437 




Rank-> 1 2345678910 
Figure 13: Entity ranking accuracy using DBpedia 
types. 

6.4.6 Coarse DBpedia types with Web corpus 
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A plausible counter-argument to the above experiments is 
that, by moving from only 358 DBpedia types to over 20,000 
YAGO types, we are making the type prediction problem 
hopelessly difficult for B&N, and that this level of type re- 
finement is unnecessary for high accuracy in entity search. 
We modified our system to use types from DBpedia, and 
correspondingly re-indexed our Web corpus annotations us- 
ing DBpedia types. As partial confirmation of the above 
hypothesis, the entity ranking accuracy using B&N did in- 
crease substantially. However, as shown in Figure |13| the 
entity ranking accuracy achieved by our discriminative al- 
gorithm remains unbeaten. Also compare with Figure [B]— 
whereas B&N improves by coarsening the type system, our 
discriminative algorithm seems to be degraded by this move. 

6.4. 7 DEANNA on telegraphic queries 

We also tried to use the Web interface to send a sample of 
our telegraphic queries and their well-formed sentence coun- 
terparts to DEANNA [37] and receive back the interpreta- 
tion. We manually inspected their output. Some anecdotes 
are shown in Figure[l4] The queries are from Figure[3] None 
of the telegraphic queries was successfully interpreted. The 
well-formed questions saw partial success. 
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queries. 

7. CONCLUSION 

We initiated a study of generative and discriminative for- 
mulations for joint query interpretation and response rank- 
ing, in the context of targeted-type entity search needs ex- 
pressed in a natural "telegraphic" Web query style. Using 
380 million snippets from a Web-scale corpus with 500 mil- 
lion documents annotated at 8 billion places with over 1.5 
million entities and 200,000 types from YAGO, We showed 
experimentally that jointly interpreting target type and rank- 
ing responses is superior to a two-phase interpret-then-execute 
paradigm. 

Our work opens up several directions for further research. 
Our notion of selectors can be readily generalized to allow 
mentions of entities as literals |14| |25] in the query. More 
sophisticated training using bundle methods may further im- 
prove the discriminative formulation. Finally, modeling list- 
wise 20 losses may also help. 
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